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*Z < 54 > Tit,e: METHOD TO FIND DISEASE- ASSOCIATED SNPS AND GENES 
< 

^ (57) Abstract: A way of identifying disease associated genes, and their mis-regulation, has been developed. This is accomplished 
00 by: 1) Analysis of 2-3kb upstream of open reading frames to identify promoter SNPs likely to be "functional"; 2) Identifying SNPs 
IT) within transcription factor clusters ("TFCs"). It appears that these TFCs can be located just about anywhere in relation to the gene(s) 
~ they regulate (5' or 3' with varying distance). Identification of Alu sequences to find presence-or-absence polymorphisms. By 
^ identifying SNPs that are located in the promoter region, one may easily identify the gene that is regulated by SNP harboring sequence 
"--^ and reasonably deduce that the gene product (or an abnormal level of the product) is somehow involved in the disease at hand. 
^ Comparison and analysis may be carried out with the sequences available in the databases identified in the provisional. The number 
of "typings" is significantly reduced by only comparing those sequences that are associated with already identified and interesting 
genes (hypertension, endocrinology, and others with known SNPs in the promoters). "Heath chips" which contain many different 
^ sequences of interest can be used for screening of patient or control samples, to generate profiles of disease associated markers and 
^ risk of disease in an individual or population of individuals. These can also be used for drug design and testing. 
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METHOD TO FIND DISEASE-ASSOCIATED SNPs AND GENES 

Background of the Invention 

The present invention is generally in the field of identifying potential 
5 DNA, RNA 5 or protein targets for drug therapy or diagnostics. 

Each gene in the genome codes for a separate protein, although it is 
possible that a single gene might code for several variants of the same 
protein. The protein is the actual work-horse in the body; the protein enables 
the cell, the tissue, the organ, and, ultimately, the organism, to live. The 
10 genes can be thought of as the instructions, or the blueprints, for life. 

Human beings have only about 30,000 separate genes in their 
genome; round worms have close to 20,000. With 40% of human genes 
having a counterpart in the fruitfly or the worm, it is clear that a human being 
is not that different than other organisms. If humans share the same building 
15 blocks, or proteins, as other species, and these building blocks have not 

changed for hundreds of millions of years, then what makes us human is not 
in the building blocks themselves. Why a human being, instead of a fruitfly 
or a worm? 

The answer is familiar to any child who plays with blocks. Starting 
20 with the same building blocks, a child knows that many different buildings 
and even cities can be constructed. What matters is the order in which the 
building blocks are used. Two large blocks followed by a small block will 
create a very different structure then two small blocks followed by a large 
block. In terms of genes, this translates to when the gene gets turned on or 
25 off, i.e. how the gene is regulated. When it is on, the gene makes a message 
which can be translated into a protein; when it is off, no new message can be 
made. Turning on genes, which themselves have been highly conserved over 
hundreds of millions of years, in a slightly different order marks the 
difference between one species and a new one. 
30 How a gene is regulated, like the product of the gene, is contained in 

the DNA sequence itself. DNA is similar to an instruction book that says not 
only how to construct a bicycle but also contains the instructions for which 
birthday to make it for. All of this is contained in the string of letters in the 
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DNA sequence: A's, G's, C's, and T's, where each letter stands for a 
different base. Remarkably, any two people differ, on average, at only one 
letter out of every 1,000. Thus, at a given spot, one person might have a C 
whereas another person might have a T. But all the letters on either side of 
5 this spot will be the same, until the next difference, roughly 1 ,000 letters 
away. These relatively few differences between people, or variants, are 
called "polymorphisms," and single base (or nucleotide) differences are 
referred to as "single nucleotide polymorphisms." The acronym for this is 
"SNP" (pronounced "snip"). 
10 The reason why one person dies of a heart attack at age 45, say, and 

another person dies of colon cancer at age 63, involves, to a large extent, the 
difference in the letters between them. Since the human genome contains 3.3 
billion positions, there are actually about 3 million differences between these 
two people. 

15 There are currently several approaches to finding the genes which 

cause disease. The oldest, or "classical" genetics approach is to use the 
variations among the DNA letters as markers. A map of 1.4 million SNPs 
has been created across the entire human genome for use as markers. It is 
estimated that at least 300,000 markers, spaced every 10,000 letters, will be 

20 required. Since detecting each marker currently costs at least $1, scanning a 
single patient would cost $300,000, an unreasonable amount. 

A second approach focuses on SNPs that could make a difference in 
how the protein actually functions. These polymorphisms occur in the coding 
sequence of the gene, and are called "coding region SNPs" or "cSNPs". 

25 Since each amino acid is encoded by a triplet of three letters (the "codon"), 
changing one of the three letters, say from a C to a T, might result in a new 
amino acid being read into the protein instead of the usual one. Many letter 
changes, especially in the third or "wobble" position, make no difference in 
the amino acid that is read out. These are called synonymous cSNPs. The 

30 SNPs which alter the amino acid are usually in the first or second position of 
the codon, or triplet of bases; these are called non-synonymous SNPs. 
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It has been possible for over two years now to mine publicly 
available databases, such as the EST database, to find coding SNPs. A 
number of pharmaceutical and biotechnology companies are using cSNPs to 
try to find disease-associated genes. 
5 However, there is no sense in using SNPs as markers, since genetic 

epidemiologists claim that you have to use over 300,000 of them for each 
patient, and this costs too much. Functional cSNPs, i.e. non-synonymous 
SNPs, make little biological sense. How could a protein that is the same in 
humans as in the mouse, i.e. that has not changed its amino acids in over 70. 
10 million years, suddenly sprout amino acid changes in humans? It might 
happen to one person in several billion, but it certainly would not explain 
why two-thirds of Americans die from heart disease and one-third die from 
cancer. 

Regulatory sequences, which determine when the gene is turned on, 

15 have increasingly been a target of investigation. This area of investigation 
has recently been termed "regulonomics". There are various levels of 
regulation, like the floors in a house. The first floor, or level, involves how 
much the gene is transcribed (ie how much messenger RNA is made from the 
gene's DNA sequence). There are additional levels of regulation, such as 

20 how much of the messenger RNA is converted into protein (or "translated"), 
how long the protein lives in the cell before it is broken down, how active the 
protein itself is, etc. The DNA sequences which control the first level (i.e., 
how much RNA is made, or "transcribed," from a particular gene) are fairly 
well known by now, although there is more work to be done. The DNA 

25 sequences for all subsequent levels are only poorly understood now, if at all. 
There are currently two major approaches to finding disease- 
predisposition genes: linkage disequilibrium (LD) and association. 

Linkage disequilibrium (LD) is the method of "classical" genetics. It 
involves using DNA samples from families, and neutral polymorphisms or 

30 "markers" spaced throughout the genome. Genetic statistics are used to find 
those markers which segregate with the disease. LD works extremely well 
with single gene diseases, such as hemochromatosis. But so far it has been 
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quite disappointing for common adult diseases caused by multiple genes, 
each of which contributes less than 5% to causing the disease. One reason is 
that not enough markers are currently available. 

The advantage of the LD method is that it allows for a whole-genome 

5 search. Thanks to the efforts of the SNP Consortium, markers (in the form of 
single nucleotide polymorphisms, or "SNPs") are now available throughout 
the entire genome. Unfortunately, families cannot be used for serious adult 
diseases because they are usually age-dependent and by definition (given the 
limitations of current medicine) occur in the last 5-10 years of a patient's 

10 life. By this time, a patient's siblings and parents are not available to provide 
their genomic DNA for a variety of reasons: if affected by the same disease, 
they would have died already; and, even if unaffected, they would not live 
nearby. (Isolated populations, such as the New World Amish or Icelandars 
are an exception to the geographic dispersion rule.) 

15 Unrelated patient populations must be used instead. For unrelated 

individuals, markers must be spaced much more closely than for family 
members. As a result, each patient's DNA must be scanned for at least 
300,000 markers (that is, a marker every 10,000 letters, or nucleotides) in 
order not to miss any disease-associated regions in the genome, especially if 

20 this region contributes only a little towards the disease (ie <5%). Also, 

because many genes (perhaps as many as 50) can cause the disease, and the 
disease may require only a subset of the 50 causative loci to manifest itself, 
hundreds if not thousands of patients must be genotyped to get as complete 
an idea of how many combinations of loci are at work. The combinations of 

25 loci also will vary from one ethnic group to another, depending on the 

genetic closeness of the ethnic group. Caucasians, Chinese, and Amerindians 
will in general share more disease loci than people of African ancestry, since 
the African population is far older (1-2 million years old vs. 100,000 years or 
less) and more genetically heterogeneous than the former groups. 

30 At $1 a genotype, the cost of performing whole-genome scans on 

several hundred patients, and an equal number of controls, is astronomical. 
For example, for 300 cases and 300 controls, solving a single disease by 
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linkage disequilibrium would cost at least $300,000 x 600 = $1 80 million 
for genotyping alone. A second disease would cost an additional $180 
million. And some genetic epidemiologists think that at least 500,000 . 
markers will be required, for an average spacing of 6,000 nucleotides 
5 between markers. 

The second method of finding disease genes is the association study. 
Patients ("cases") and controls (healthy people, ie "super-controls") are 
compared for the frequency of a given version of a gene ("allele"). Super- 
controls, such as plasma donors obtained through Interstate Blood Bank 
10 (Memphis, TN) are used because it is not known a priori which diseases are 
caused by the same gene, making the use of patients with a second disease 
unsuitable as a control group. 

For example, let us say that a particular position within a gene is 
polymorphic, and exists either as a "C" or a "T" in the population. Then an 
15 association study would determine the frequency of "C's" and "TV among 
cases and controls. If the frequency of the "C" allele was 40% among 
patients for a given disease, but only 10% among controls, and this 
difference was statistically significant, then the "C" allele would be said to 
be associated with the disease. 
20 The case-control, or association, method is sensitive to small 

contributions by individual genes, which is highly desirable when perhaps 50 
genes are involved in causing disease in a given population. But the 
disadvantage of the case-control method, until this method, is that it required 
first guessing which gene is involved with the disease. The problem with a 
25 "candidate gene" approach is that too little of the genomic anatomy of a 

disease is known to be able to guess which 50 genes might be involved with 
any accuracy. Furthermore, the case-control method is subject to false 
positive results. Should the threshold probability value "p" be 0.05, or as low 
as 10(-4) as claimed by some (Neil Risch, Science, 1996) If multiple SNPs 
30 are tested simultaneously, the statistical problem of correction for repetitive 
testing cannot be solved. 
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It is therefore an object of the present invention to provide a cost 
effective method and means for analysis of regulatory sequences. 

It is a further object of the present invention to provide a method and 
means for determining what markers or changes in regulatory sequences may 
5 be associated with specific diseases. 

Summary of the Invention 
A way of identifying disease associated genes, and their mis- 
regulation, has been developed. This is accomplished by: 

1) Analysis of 2-3kb upstream of open reading frames to identify 
10 "functional" SNPs (this eliminates the class of SNPs that are a result of a 

change in the "wobble" position of the ORF - therefore not very interesting 
because the amino acid sequence of the protein remains unchanged). 
Functional SNPs are more likely to be found in this scenario because 
transcription factors are very sensitive to nucleotide changes in the sequence 
1 5 that they recognize for binding. 

2) Comparing transcription factor clusters ("TFCs") and identifying 
SNPs within these clusters. It appears that these TFCs can be located just 
about anywhere in relation to the gene(s) they regulate (5 ? or 3' with varying 
distance). 

20 3) Identifying Alu sequences. It appears that these are human-like 

transposons that can jump around via a recombination mechanism and 
interrupt whatever sequence they insert. These sequences may form tRNA 
like structures severely inhibiting the binding of any transcription factors that 
bind in or around the area. This Alu retroposon sequence is known. 

25 By identifying SNPs that are located in the promoter region, one may 

easily identify the gene that is regulated by the SNP harboring sequence and 
reasonably deduce that the gene product (or an abnormal level of the 
product) is somehow involved in the disease at hand. Comparison and 
analysis may be carried out with the sequences available in the databases 

30 identified in the provisional. The number of "typings" is significantly 

reduced by only comparing those sequences that are associated with already 
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identified and interesting genes (hypertension, endocrinology, and others 
with known SNPs in the promoters). 

"Heath chips" which contain many different sequences of interest can 
be used for screening of patient or control samples, to generate profiles of 
5 disease associated markers and risk of disease in an individual or population 
of individuals. These can also be used for drug design and testing. 

Detailed Description of the Invention 
A method focusing on polymorphisms in the regulatory regions of 
genes that cause the majority of diseases has been developed for use in 
10 diagnostic techniques and to assist in the design of drugs targeted to specific 
diseases. This method combines the whole-genome inclusiveness of LD 
with the sensitivity and simplicity of association studies. Rather than using 
SNPs as "markers," as LD does, this method uses SNPs which themselves 
could be the cause of disease, ie are "functional." These SNPs are taken from 
15 the region of the gene that controls its expression ("transcription"). A single 
letter difference in a transcription factor binding site could make the 
difference between a site which binds a transcription factor tightly versus 
loosely. 

Whole genome coverage is obtained in two ways: by looking at 
20 promoters and transcription factor clusters (TFCs). A "promoter" is defined 
as the stretch of DNA to the left (i.e. upstream or 5') of the gene itself. In 
about half of genes, it is upstream (5 5 ) to a TATA box, although the other 
half of genes do not have a recognizable TATA box. The number of DNA 
letters that constitutes the promoter is ill-defined, but 3,000 bases upstream 
25 (5') of the start site for transcription is a reasonable upper limit in practice. 
There are software programs available for identifying open reading frames 
(i.e. genes) as well as the transcription start site. The relevant 3kb of the 5' 
region can be easily deduced, when the raw sequence is known (as is the case 
for 90% of the genome currently). 
30 The second way of including transcriptionally active regulatory sites 

from throughout the entire genome is to use transcription factor clusters. 
TFCs were recently described by David States and his group at Washington 
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University in U.S.S.N. 20020027519 published March 28, 2002, entitled 
"Identifying clusters of transcriptional factor binding sites"! TFCs are 
clusters of transcription factors, occurring in groups of four or more binding 
sites. What makes them likely to be involved in transcription is that the total 
5 number of TFCs (about 40,000-50,000) corresponds closely to the total 
number of genes in the human genome (about 30,000-40,000). It is 
extremely unlikely that these clusters occurred simply by chance. Thus, it 
seems that there is close to a one-to-one correspondence between TFCs and 
SNPs. Focusing on TFCs should net the entire genome, and provide the 
10 whole-genome coverage required to find most disease-associated alleles. 

SNPs in promoter (5') regions and TFCs can be determined most 
easily using the public human genome and SNP databases. To find promoter 
SNPs, 5' untranscribed regions can be obtained by standard bioinformatics 
methods from the genome and stored as a file. This file of 5' regions can 
15 then be compared against the public SNP database (dbSNP). It is estimated 
that a total of 50,000 "promoter" SNPs might be obtained this way. Perhaps 
an additional number (up to 90,000) could be obtained from a more complete 
SNP database such as privately held ones, e.g. Celera's 2.4 million SNPs. Of 
course, additional SNPs could be identified directly by PCR amplification of 
20 5 ' regions and sequencing of a number of individuals (e.g. a mixture of 96 
African Americans, Caucasians, and Chinese). 
Promoter (5* region) SNPs 

Ideally, the entire human genome would be annotated, and every 5' 
region of every gene already known. Then, approximately 2 kb of each 5' 
25 region would be examined for overlap with the public SNP database, dbSNP. 
The intersection of the two databases would yield a whole genome list of 5' 
region (promoter) SNPs. These would be placed on a microarray ("chip") for 
ultra-high throughput genotyping as described below. 

Practically speaking, however, the entire human genome is not yet 
30 annotated, nor is every 5' region yet known. Even if it were, the collection of 
promoter SNPs derived from the entire genome will be large and 
cumbersome. At an average occurrence of 1 SNP per 500 base pairs, 4 SNPs 
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are expected in a 5' region (promoter) 2 kb in length. For an estimated 
35,000 genes, this amounts to 140,000 SNPs. Performing 5,000 SNP typings 
on a single glass slide ("chip") by primer extension is the current state of the 
art. But using anything less than 140,000 SNPs means less than a whole 
5 genome scan. Finding disease genes is like fishing for elusive fish: the wider 
the net, the higher the .probability of success. A strategy for ordering 
promoter SNPs is therefore required in order to maximize the chances for 
"catching" disease genes in a net of finite size. 

Essentially, this reduces to the problem of drawing up a list of 
10 candidate genes. The following lists are proposed: 

1. 75 Hypertension candidate genes. Reference: Nature Genetics, July, 
1999. Vol. 22(3): 239-247. PMID (PubMed ID No.): 10391210. 

2. 106 candidate genes for hypertension and endocrinology. Reference: 
Nature Genetics, July, 1999. Vol. 22(3): 231-238. PMID: 10391209. 

15 3 . Approximately 700 genes selected by the author (see Appendix). 

4. 1 03 1 genes, in which promoter SNPs have already been found. 
Reference: Genome Research, May, 2001. Vol. 11(5): 677-684. GenBank 
Accession Numbes AU 098358- AU 100608. 

5. Online Mendelian Inheritance in Man (OMIM). As of today, OMIM 
20 consists of approximately 9,700 genes, including 37 mitochondrial genes. 

Reference: http://www.ncbi.nlm.nih.gov/entrez/Omim/mimstats.htnil . 

The advantages of using OMIM as a list of candidate genes are as 
follows: 

(A) Every gene in OMIM is already associated with a disease phenotype. 
25 This increases the likelihood that dysregulation of any of these genes because 

of one or more regulatory polymorphisms will also result in a disease 
phenotype. 

(B) The number, almost 1 0,000, represents about one-third of the entire 
30 human genome. Thus, it should net at least one-third of all disease genes. 

SNPs can be discovered in silico by searching for the intersection of 
the candidate genes with dbSNP, or in vitro by amplification and direct 
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sequencing of at least 10 individuals (20 chromosomes) to detect alleles 
present at 5% frequency in the population. 

Alu insertion/deletion polymorphisms 

Ninety-five percent of the genome consists of intergenic DNA. This 
5 vast tract of DNA is ignored for now. Regulatory polymorphisms will 

instead be sought within genes first, in 5'untranscribed regions (promoters), 
3' untranslated regions, and introns. 

Introns themselves can be much larger than the exonic portion of a 
gene. Apart from splicing site polymorphisms which control whether exons 
10 are correctly spliced together, little is known about how intronic 

polymorphisms affect the rate of transcription or splicing. An exception is 
the insertion/deletion polymorphism involving Alu sequences. 

Alu sequences consist of about 300 base pairs, and represent two 
transfer RNA molecules held together by an approximately 25 base-long 
15 "necklace." The bases of the "necklace" are highly variable, but their number 
is not. The two tRNA molecules in an Alu sequence resemble the tRNA for 
lysine most closely. Alu's support transcription by RNA polymerase III, the 
same enzyme used for transcription of tRNAs. Alu's are called retroposons 
since they can integrate into DNA. Indeed, 5% of human DNA consists of 
20 Alu sequences. The ability of Alu's to integrate into DNA may be due to the 
affinity of recombination enzymes for the Alu sequence. Indeed, one 
possibility for why Alu's occur so frequently is that they might act like 
"tabs" to align sister chromatids during meiotic recombination. 

In 1990, the angiotensin I-converting enzyme (ACE) gene was found 
25 to have an Alu sequence inserted into intron 16 with a frequency of about 

50% in Caucasians. The frequency of this Alu insertion allele is lower among 
Africans, e.g. 33% among Nigerians, and higher among Asians, e.g. 90% 
among Japanese and Chinese. 

The Alu deletion allele is associated with an approximately twice 
30 higher rate of transcription of ACE than the insertion allele. Electron 

microscopy shows that the Alu in intron 16 forms a cruciform structure. 
When nucleoplasm is poured over a column containing Alu sequences 
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covalently linked to beads, a number of recombinase enzymes and other 
nuclear proteins are bound. The Alu sequence may represent an archaic form 
of RNA from "The RNA World" which was optimized for interactions with 
nuclear proteins and nucleic acids. 
5 It is therefore likely that any Alu occurring in an intron will delay 

transcription of the gene it is located in, in the same way as the Alu occuring 
in intron 1 6 of some versions of the ACE gene. It is also possible that an Alu 
occurring in the 5' region of a gene may interfere with the assembly of 
transcriptional complexes nearby due to the severe tRNA-like secondary 

10 structure which Alu sequences adopt. As a result, the "deletion" variant of an 
Alu insertion/deletion polymorphism is expected to have higher gene 
expression than the "insertion" allele. If the gene causes disease, then the 
deletion allele is expected to be associated with the disease. 

Similarly, the occurrence of on Alu sequence in the 3' region of the 

15 gene may conceivably affect stability or the rate of processing of messenger 
RNA; no such Alu sequences have yet been described. 

A rapid method to screen untranscribed regions of genes (introns and 
5' regions) for Alu polymorphisms is as follows: 

1 . Examine GenBank for annotated genes. Locate Alu sequences in the 
20 annotated portion of the 5' region or intronic sequence. 

2. To see if there is a population polymorphism at the 5% level, take 
genomic DNA from 10 individuals of a given ethnicity, constituting 20 
copies of the autosomal genes (except for rDNA genes). Design primers to 
amplify -600 bases including the Alu from each sample at each location in 

25 the genome, using PCR or another suitable amplification method (e.g. 
Rolling circle amplification). 

3. The samples can be analyzed in separate lanes, or pooled and run in a 
single lane for efficiency. The presence of an Alu polymorphism will be 
indicated by the appearance of a band of approximately 300 nucleotides after 

30 standard agarose gel electrophoresis. 
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4. Genotyping can be performed in the same manner, using PCR 
amplification followed by agarose gel electrophoresis. Other genotyping 
methods can be used, such as hybridization. 

5. Transcribed Alu sequences in the 3' region of genes may be identified 
5 by performing a BLAST search of the the EST database using a consensus 

Alu sequence. Polymorphisms can be detected by aligning multiple readings 
of the same 3' region. 

To find TFC SNPs, the SNP database (dbSNP or the Celera SNP 
database) is stored as a large file on a computer and then compared to the file 

10 of TFCs currently available from Washington University. SNPs in the TFCs 
are obtained by simply overlaying the TFC database on the SNP database by 
computer. A desktop Pentium IV computer with 2 Gb RAM and 75 Gb hard 
drive running for approximately one week is sufficient for this purpose. 
Ultra-high throughput SNP typing 

15 The method described herein requires genotyping each genomic 

DNA sample (prepared from whole blood or tissue by standard methods) for 
the above approximately 50,000 promoter SNPs and/or approximately 
50,000 TFC SNPs in a massively parallel fashion, using as little DNA as 
possible. Currently the following methods are available: 

20 (i) microarray ("chip") technology whereby the 50,000 SNPs are 

covalently linked to a glass slide, glass bead, or other firm support ("chip") 
and each SNP typed by simple hybridization or the combination of 
hybridization plus an enzymatic reaction, e.g. primer extension. These 
methods currently use as little as 0.1 ng genomic DNA which is amplified by 

25 multiplex PCR for every SNP on the glass slide, and the SNPs are detected 
for both the (+) and (-) strand; 

(ii) massively parallel SNP typing, although still one SNP at a time, 
e.g. by Pyrosequencing which can accurately type 1 ng (or as little as 0.1 ng 
in pooled samples; up to 100 samples can be pooled for allele frequency, but 

30 not individual genotype frequency, data). Mass spectroscopy is another 

accurate method of SNP typing which is currently available, but it requires 
more than 0.1 ng of template genomic DNA. 

12 
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Any of the methods using the latest in SNP-typing technology for the 
highest throughput, least expensive, yet accurate SNP-typing, can be utilized. 
DNAprint genomics in Sarasota, Florida, for example, can currently type 12 
SNPs per 384 well plate using an Orchid Biosciences UHT-SNPstream 
5 machine for $0.40 a SNP. 

Statistical Approaches to Microarrav SNP Typing 

The statistical problem of correcting for multiple comparisons has 
been alluded to above. The Bonferroni correction is particular harsh: 10 4 
SNP-typings would require a p value of 10" 8 for any association to reach 

10 significance at the 10^ level. Computationally intensive statistical methods 
have been developed by Jurg Ott (Ott J, Hoh J. Am J Hum Genet. 2000 
Aug;67(2):289-94. PMID: 10884361) indicates that such high levels are not 
necessary. In essence, all of the SNP typings on a given microarray ("chip") 
are treated as a single sum, and a nested bootstrap method used to identify 

15 those allele and genotype differences between cases and control which are 

most significant statistically, without the need for a multiple-assay correction 
method. 

A more objective but more computationally intensive approach has 
also been devised recently (Ritchie et al. Am J Hum Genet. 2001 
20 Jul;69(l):138-47. PMID: 11404819). 

Avoiding False Positive Associations due to Population Stratification 

Perhaps the most serious shortcoming of case-control studies is the 
difficulty of matching cases and controls. When cases and controls are not 
matched for ethnicity, then allele frequencies which differ solely due to 
25 population stratification can look like disease-associated differences instead. 
Schork has suggested a way to correct for population stratification using 

r 

neutral loci spread throughout the genome, e.g. two per chromosome 
(Schork, et al. Adv Genet. 2001;42:191-212. PMID: 11037322). 
Mitochondrial and Y chromosome loci can also be used, as in human 
30 population genetics. An average ratio of allele frequencies (case/control) is 
determined from at least 30 such neutral, marker loci, e.g. 1.05. Allele 
differences at all other loci (i.e. for putative functional, regulatory SNPs) are 
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corrected by this factor. For example, if the frequency of a given allele was 
48% among cases and 32% among controls, the corrected allele frequency 
among cases would be 48/L05 = 45.7%. This latter value would be 
compared to the control group allele frequency of 32%. 
5 The yield of mitochondrial DNA can be increased, if necessary, by 

using a 2nd, higher speed centrifugation after low-speed pelleting of 
leukocyte nuclei during preparation of DNA from whole blood or tissue 
specimens. 

Several examples of disease-associated promoter and TFC SNPs, 
10 culled from the literature, follow. 
Both Promoter and TFC Overlap 
1. PDGF-A chain 

Platelet-derived growth factor A chain contains two experimentally 
verified transcription factor binding sites in the 5' untranscribed region 
15 which are also present in a TFC (States, et al (2000) "Identifying Clusters of 
Transcription Factor Binding Sites in the Human Genome" (under review); 
Wingender, et al. Nucleic Acids Res. 28, 316-319 (2000); Gashler, et al. Proc 
Natl Acad Sci USA. (1992) 89(22): 10984-8. PMID: 1332065). The 
sequence from position 853 to 861 according to GenBank Accession Number 
20 S62078 is predicted to bind the SP1_Q6 transcription factor (nomenclature 
according to TRANSFAC); the sequence from position 873 to 886 is 
predicted to bind the general transcription factor GC1. 

A TFC is predicted to stretch from position 27 to position 3830 
according to GenBank Accession Number S62078, thus containing both 
25 experimentally verified transcription factor binding sites. 
Promoter is Explanatory. TFCs are Not 
1. Apolipoprotein E 

Perhaps the best example of a promoter rather than TFC SNP being 
disease-associated is the association of a SNP in the 5* untranscribed region 
30 of the apolipoprotein E (Apo E) gene with Alzheimer's disease (Roks, et al. 
Neurosci Lett. (1998) 258(2):65-8. PMID: 9875528). The -491 A-->T SNP 
in the Apo E gene, relative to the start of transcription, corresponds to A560T 
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according to GenBank Accession Number AF261279. Although strongly 
associated with Alzheimer's disease, this SNP does not occur in a TFC. The 
Apo E gene has two TFC's: the closest to this SNP runs from position 1818 
to 1963 according to GenBank Accession Number AF261279, and so is 1258 
5 nucleotides distant. The second TFC extends from position 385 1 to 4541 
according to GenBank Accession Number AF261279. 

Thus, this disease-associated SNP resides in the promoter of Apo E 
but is at least 1200 bases away from the nearest TFC. 
2. UDP-glucuronosyltransferase I (Gilbert's syndrome) 
10 Gilbert's syndrome was recently discovered (Bosma, et al. N Engl J 

Med. (1995) 333(1 8): 1 171-5; PMID: 7565971) to result from disruption of 
the TATA box in the UDP-glucuronosyltransferase I gene when a (TA)6 
repeat is miscopied to become a (TA)7 repeat (positions 3141 to 3150 
according to GenBank Accession Number D87674). This gene does not have 
15 a TFC. This example illustrates that there are several levels of transcriptional 
control, and that disruption of the RNA polymerase II binding site by an 
extra (TA) dinucleotide can also reduce the level of gene transcription in the 
absence of control by a TFC. 
TFCs are Explanatory, Promoter is Not 
20 1. Dopamine D2 receptor 

Two SNPs illustrate the significance of the TFC. An insertion of a C 
at position -141 relative to the transcription start site (position 6181 insertion 
C in GenBank Accession Number AF148806; refs. Ohara, et al. Psychiatry 
Res. (1998) 81(2):1 17-23. PMID: 9858029; Arinami, et al. Hum Mol Genet. 
25 1997 6(4):577-82. PMID: 9097961) is associated with higher protein (and/or 
mRNA) levels of the dopamine D2 receptor. A transition further upstream 
(i.e. 5'), namely the substitution of a G for an A at position -241 relative to 
the transcription start site (A6081G according to GenBank Accession 
Number AF148806), has no effect on dopamine D2 receptor levels. That is, 
30 the A608 1 G SNP is neutral. 

Both SNPs lie within 250 bases upstream of the transcription start 
site. Yet only the 6181insC SNP lies in the TFC for the dopamine D2 
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receptor gene. The TFC for this gene runs from position 6120 to position 
6636 (according to GenBank Accession Number AF148806). The 6181insC 
polymorphism is located between an NF-kappaB 50 binding site (at position 
6162 to 6171) and a Pax5_01 binding site at position 6195 to 6222. The 
5 A608 1 G lies upstream of the beginning of the TFC. 

It is powerful evidence of the significance of the TFC for gene 
expression that a SNP which lies within the TFC affects gene expression, but 
a SNP which lies only 39 bases away (6120-6081) makes no difference to 
gene expression. 
10 2. Manganese-superoxide dismutase (Mn-SOD) 

Two SNPs in the Mn-SOD gene have been located using tumor DNA 
(fibrosarcomas, Xu, et al. Oncogene. 1999 Jan 7;18(1):93-102. PMID: 
9926924). Both SNPs result in decreased mRNA levels: -102C~>T relative 
to the transcription start site (C68 IT according to GenBank Accession 
15 Number S77 1 27), and -3 8C->G relative to the start of transcription (C745G 
according to GenBank Accession Number S77127). The C681T 
polymorphism results in decreased binding by Sp 1 ; the C745G 
polymorphism results in decreased binding by AP-2. Both are widely used 
transcription factors. 
20 The TFC for the Mn-SOD gene runs from position 426 to position 

1 139 according to GenBank Accession Number S77127. The C681T 
polymorphism disrupts a binding site for SP1_Q6 between positions 669 and 
681 on the (+) strand, using the terminology of TRANSFAC and Genomatix 
software to predict transcription factor binding sites. The C745G 
25 polymorphism disrupts the potential binding site for MZF101 on the (-) 
strand; the experimental finding of decreased binding by AP-2 was not 
predicted by the Genomatix software. 
3. Beta-globin locus control region (LCR). 

The beta-globin LCR is a region of about 8,000 base pairs that 
30 controls expression of the beta-globin gene even though it is located 65,000 
base pairs away from it. Experimental evidence indicates that an HS-2 site is 
required for expression of beta-globin (Cooper, et al. Ann Med. 1992 
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Dec;24(6):427-37. PMID: 1283065). The sequence for the beta-globin LCR 

is contained in GenBank Accession Number AF064190. This sequence 

contains a TFC spanning positions 2840 to 3 1 19, consistent with this 

region's being important in gene regulation. 
5 4. Psoriasin (S 1 00 A7 gene) 

Psoriasin, or the S100A7 gene, was recently sequenced. Two 

polymorphisms in the 5' region of the gene were discovered (Semprini, et al. 

Hum Genet. 1999 Feb; 104(2): 130-4. PMID: 10190323): -559G->A relative 

to the transcription start site (G195A according to GenBank Accession 
10 Number AF050167), and -563 A— >G relative to the transcription start site 

(A191G according to GenBank Accession Number AF050167). Although 

located in the 5' region of a candidate gene for psoriasis, neither SNP was 

found to be associated with the disease. 

TFC analysis of the psoriasin gene reveals the potential reason: 
15 psoriasin does not contain a TFC. This example suggests that a SNP within a 

TFC is more important for gene regulation than a SNP within the promoter 

(5 'untranscribed region). 

5. C-myc 

C-myc is a proto-oncogene in which a SNP has been identified in 
20 exon 1 (C->T at position 2756 according to GenBank Accession Number 

J00120) [A mutation in the c-myc-IRES leads to enhanced internal ribosome 
entry in multiple myeloma: a novel mechanism of oncogene de-regulation. 
Oncogene. 2000 Sep 7;19(38):4437-40. PMID: 10980620 ]. Although this 
SNP has been claimed to disrupt an Internal Ribosome Entry Sequence 
25 (IRES) with an effect on translation of the messenger RNA for c-myc, it also 
disrupts a PAX5_02 transcription factor binding site in the TFC predicted for 
c-myc. This SNP may well have important disease associations, but would 
not be considered if only promoter (5' untranscribed region) SNPs were 
examined. 

30 Finding Disease- Associated SNPs: Strategy 

1. Identify regulatory SNPs throughout the genome. 



17 



This method's competitive advantage lies in the power of 
bioinformatics. Rather than pursue coding sequence SNPs ("cSNPs"), this 
method focuses on the relatively unexplored depths of non-coding DNA. But 
the goal will remain whole genome coverage. Regulatory region SNPs will 
be identified in every gene. 

Chips will be assembled in the following order: 

Transcription factor cluster (TFC) SNPs (chip #1); 

5' ("promoter") region SNPs (chip #2). 

SNPs will first be derived from-the public database (dbSNP). If 
neither chip #1 nor chip #2, using publicly available SNPs, is sufficient to 
find disease-associated SNPs with sufficient statistical significance, then 
additional SNPs will be added. The strategy will be to use the smallest 
number of chips which can net 5 to 10 different genes per disease, assuming 
that perhaps 20 genes may actually be involved in each disease. It is 
impractical to identify more than a dozen new drug targets for each disease, 
given the cost of new drug development and the limited number of Research 
Pharmaceutical companies. 

The first approach to finding additional SNPs will be computational. 
An additional 500 nucleotides will be added to both the 5' and 3' ends of 
each TFC and promoter, and this wider net used to troll for additional SNPs. 
These SNPs are expected to be in linkage disequilibrium with the TFC or 5' 
or 3' region in question, and makes it possible to include these regions 
without the need to do additional SNP discovery. These additional SNPs 
will make up chip #la and chip #2a. 

If use of the additional SNPs derived computationally is still 
insufficient to find strongly disease-associated SNPs, then selected TFC and 
promoter regions will be amplified and sequenced directly to find SNPs. 
SNPs obtained by direct sequencing of TFCs will constitute chip #lc; 
promoter SNPs obtained by sequencing will make up chip #2c. Thirty 
samples are pooled and SNPs used whose peak height exceeds 20% of the 
majority peak [Marth, et al. Nat Genet. 1999 Dec;23(4):452-6]. 
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2. Develop the SNP chips 

Start with 100 regulatory region SNPs (either derived from TFC's or 
5 5 regions). Using control DNA, demonstrate reproducible, reliable 
genotyping at these 100 loci for one dozen different control individuals. 
5 Next, expand to 6,000-1 0 5 000 SNPs (chip #1). Demonstrate 

reproducible SNP-typing for one dozen control samples (ie genotype 12 
samples using 6 different chips. Compare the results for each chip). 

Next, set up chip #2. 
2. Using a single disease (e.g. sporadic, non-familial breast cancer in 
10 American Caucasian women), use chips #1 and #2 to find disease-associated 
SNPs. 

Obtain the samples from a supplier, e.g. the Coriell Cell Repository 
(10 micrograms available for $50, average price), collaborators at the 
National Cancer Institute, etc. 
15 Ship the samples to the Chip Lab. 

Perform genotyping for chips #1 and #2. 

Transmit data for statistical analysis. 

Perform data analysis. 

Identify disease-associated SNPs. 
20 3. Obtain samples from commercially important diseases (Table 1): 

American Caucasians, both men and women, 250 cases each; 

Pick diseases of high commercial value but not already soived-need 
competitive intelligence onNHLBI's Hypertension Genetic Network, as well 
as private sector efforts. 
25 Use chips #1 and #2, perhaps augmented by additional SNPs, to 

genotype additional diseases. 
Technical Objectives 

1 . Collect as many regulatory SNPs as possible into a single database 

A. "Promoter" SNPs, 1-2 kb upstream from the transcription start 
30 site-involves standard methods in Bioinformatics, as described above. 
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B. TFC SNPs, in newly recognized regulatory regions that are 
somewhat analogous to "enhancers". These TFC's are not generally accepted 
yet as regulatory regions. 

C. 3 5 UTR SNPs that control stability of messenger RNA will be 
collected on a continuous basis from the literature (Medline searches). 

2. Include some neutral but ethnically informative SNPs (from the Y 
chromosome) to insure that cases and controls are well matched ethnically. 

3. Utilize a genotyping lab. The following are representative: Asper 
Biotechnology, Tartu, Estonia; Orchid Biosciences, Princeton, NJ; 
Sequenom, San Diego (www.sequenom.com); Illumina, San Diego 
(www.illumina.com); Celera (Taqman) (www.celera.com); Gemini 
Genomics (www.gemini-genomics.com); Genomics Collaborative 
(www.getdna.com); Incyte (www.incyte.com); Lynx Therapeutics 
(www.lynxgen.com); Myriad Genetics (www.myriad.com); GeneScan 
(www.genescan.com); GenOdyssee (www.genodyssee.com); Amersham 
Pharmacia Biotech (www.apbiotech.com); Paradigm Genetics 
(www.paragen.com); Promega (www.promega.com); Qiagen Genomics 
(www.qiagen.com). DNA sequencing labs: e.g. MWG-Biotech, 
www.genotype.de, WEHI in Melbourne, Australia; Hyseq (www.hyseq.com) 

4. Get DNA samples, for example, from existing collections, such as the 
Coriell Cell Repository and the Southwest Oncology Group (SWOG); 
Genomics Collaborative (www.getdna.com); DNA Sciences 
(www.dna.com); Gemini Genomics (www.gemini-genomics.com); First 
Genetic Trust (www.firstgenetic.net); Novartis; Bristol-Myers Squibb; Incyte 
(www.incyte.com); and Myriad Genetics (www.myriad.com), or obtain 
samples, for example, from hospital(s). 

The information obtained from these collections of SNPs or "chips" 
can be used for protein prediction and smart-molecule design, empirical drug 
testing, "high throughput screening" companies; toxicology companies; 
animal models/animal studies companies; and drug production. 

The information can also be used for prognostics to predict likelihood 
of developing one or more diseases. 



20 



Construction of a "Health Chip" . 

A Promoter SNP is defined as a single nucleotide polymorphism 
within 2 kilobases upstream of the 5'-end of a RefSeq gene. RefSeq consists 
of a highly curated database of approximately 14,000 gene transcripts, 
representing between one-half to one-third of the entire human genome. It is 
the best available sequence for human genes, and is derived from mRNA and 
EST sequences. A computer system with sufficient local memory (RAM) 
and speed was configured to access and interrogate the relevant public 
databases (see below). 

Each RefSeq sequence was first positioned along the Golden Path 
Assembly (UCSC Human Genome Assembly, version 2001-04-01). The 2 
kilobases upstream of the transcription start site were saved into a new 
database ("Upstream regions"). The "Upstream regions" database was then 
overlaid onto dbSNP, the publicly available SNP database, in order to find 
SNPs specifically in upstream regions of RefSeq genes. 

This list of promoter SNPs can be used for high-throughput 
genotyping, such as by microarray (e.g. arrayed primer extension, APEX), in 
order to find disease-associated SNPs and genes. Because RefSeq is being 
constantly updated, and will eventually contain the transcripts of all human 
expressed genes, this list of approximately 12,000 Promoter SNPs derived 
from approximately 4,000 genes is referred to as version 1 .0 
("HealthChip_l"). It is anticipated that there will be additional, updated 
versions of this list as RefSeq is updated. It is anticipated that there are 
approximately 10 times as many total SNPs, or 120,000 total Promoter 
SNPs. 

Public Databases Interrogated to derive the list of Promoter SNPs 
["Promoter GeneNetfTM applied forVM 

1. NCBI RefSeq (version 2001-06-15) 

ftp://ftp.ncbi.nlm.nih.gOv/refseq/H sapiens/mRNA Prot/hs.fha.gz 

2. UCSC Human Genome Assembly (version 2001-04-01) 
http://genome.cse.ucsc.edu/goldenPath/01apr2001/bigZius 
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3. NCBI dbSNP (version 2001-08-04) 
ftp://ftp.ncbi.nlm.nili. gov/snp/human/rs fast 

Table 1 : List of Adult Diseases Whose Associated Genes Can 
Be Found Using This Method 

(Note 1: This List Also Applies to Common, Polygenic Pediatric Diseases, 
e.g. Juvenile RA as well as RA [Rheumatoid Arthritis]) 
(Note 2: Abbreviations are Standard, e.g. CRF= Chronic Renal Failure. The 
numbers given in the columns to the right apply to possible sample numbers 
from different collections) 

(Note 3: The most common, non-redundant diagnoses are numbered 1-222). 
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39. BDR 265 

40. Pre-proliferative 49 

41. Proliferative 68 

42. DME or CSDME 91 

43. S/p laser photocoag. 121 
NIDDM Neuropathy 
Yes(NOS) 134 

44. Autonomic 33 
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46. Gastroparesis 70 

47. Neurogenic bladder 24 
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54. Paget 5 s disease 9 

55. Osteoporosis 16 



56. Renal osteodystrophy 21 
Lipid disorders 

57. Chol>250, TG<200 192 

58. Chol<200, TG>300 51 

59. Chol>250, TG>300 99 
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"Hypercholesterolemia" 93 0 
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I claim: 

1 . A method of identifying disease specific polymorphisms comprising 
screening non-coding nucleotide sequence selected from the group 

consisting of non-coding nucleotide sequence three kilobases upstream of the 
5' start site of protein encoding sequences and non-coding intergenomic 
sequences,for polymorphisms. 

2. The method of claim 1 wherein the protein encoding sequences are 
associated with a disease or disorder. 

3. The method of claim 1 further comprising comparing transcription 
factor clusters in the sequences and identifying single nucleotide 
polymorphisms within these clusters. 

4. The method of claim 1 comprising screening for Alu sequences in the 
non-coding sequences. 

5. The method of claim 4 wherein the Alu sequences form tRNA like 
structures. 

6. The method of claim 1 comprising identifying single nucleotide 
polymorphisms in the promoter region of a protein encoding sequence. 

7. The method of claim 2 comprising identifying the disease or disorder 
associated gene that is regulated by the single nucleotide polymorphisms 
harboring sequence and deducing that the gene product or an abnormal level 
of the product. 

8. The method of claim 1 wherein the analysis is carried out with the 
sequences available in publically available databases. 

9. The method of claim 8 wherein the sequences are associated with 
genes associated with hypertension and endocrinology. 

1 0. The method of claim 8 wherein the sequences contain single 
nucleotide polymorphisms in the promoter regisons. 

11. A microarray or chip comprising a plurality of non-coding nucleotide 
sequences selected from the group consisting of non-coding nucleotide 
sequence three kilobases upstream of the 5' start site of protein encoding 
sequences and non-coding intergenomic sequences, wherein the nucleotide 
sequences comprise polymorphisms. 
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1 2. The microarray of claim 1 1 wherein the protein encoding sequences 
are associated with a disease or disorder. 

13. The microarray of claim 1 1 wherein the nucleotide sequences 
comprise transcription factor clusters. 

14. The microarray of claim 13 wherein the transcription factor clusters 
comprise single nucleotide polymorphisms. 

15. The microarray of claim 1 1 wherein the sequences comprise Alu 
sequences in the non-coding sequences. 

16. The microarray of claim 15 wherein the Alu sequences form tRNA 
like structures. 

17. The microarray of claim 1 1 comprising protein encoding sequences 
comprising single nucleotide polymorphisms in the promoter region of a 
protein encoding sequence. 

1 8. The microarray of claim 1 1 comprising sequences known to be 
associated with a disease or disorder. 

19. The microarray of claim 1 1 comprising control sequences not 
associated with a disease or disorder. 
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